Skip to content

ggml: optimize ggml_vec_dot_mxfp4_q8_0 dot product on ARM SVE#19171

Open
jiangshhh wants to merge 1 commit intoggml-org:masterfrom
jiangshhh:sve-ggml_vec_dot_mxfp4_q8_0-opt
Open

ggml: optimize ggml_vec_dot_mxfp4_q8_0 dot product on ARM SVE#19171
jiangshhh wants to merge 1 commit intoggml-org:masterfrom
jiangshhh:sve-ggml_vec_dot_mxfp4_q8_0-opt

Conversation

@jiangshhh
Copy link

@jiangshhh jiangshhh commented Jan 29, 2026

Proposal

This proposal introduces an ARM SVE-optimized implementation of ggml_vec_dot_mxfp4_q8_0 for the ggml/llama.cpp CPU backend.
The current implementation relies on scalar or NEON-based code paths, which do not fully utilize the wide vector capabilities available on modern ARM CPUs equipped with Scalable Vector Extension(SVE). By leveraging SVE intrinsics, this proposal aims to:

  1. Improve utilization of vector registers on SVE-capable platforms, independent of fixed vector widths
  2. Maintain numerical equivalence with the existing reference implementation
  3. Ensure portability across different SVE vector lengths

Verifying Features

The proposed SVE implementation was verified with the following considerations:

  1. Functional Correctness
    Accumulation logic and scaling factors follow the original ggml_vec_dot_mxfp4_q8_0 definition.
  2. Architectural Safety
    The implementation uses SVE intrinsics only, without assuming a fixed vector length.
    The SVE path is guarded by __ARM_FEATURE_SVE to ensure it is executed only on supported hardware.
  3. Fallback Compatibility
    Non-SVE platforms continue to use the existing scalar or NEON implementations without modification.
    The change does not affect other quantization paths.

Performance check

The performance was measured with FX700.
Performance is improved as follows. The value is tokens per second.

Batch size Original (NEON) This PR (SVE) Ratio
1 3.66 8.60 2.35
2 3.73 9.04 2.42
4 3.76 9.25 2.46
8 3.75 9.08 2.42

The command used to measure the performance is
llama-batched --model ${PATH_TO_MODEL} --prompt 'AI is going to' --parallel 8 --predict 128 --seed 0 --threads 48

@jiangshhh jiangshhh requested a review from ggerganov as a code owner January 29, 2026 06:48
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 29, 2026
@jiangshhh jiangshhh changed the title ggml: pptimize ggml_vec_dot_mxfp4_q8_0 dot product on ARM SVE ggml: optimize ggml_vec_dot_mxfp4_q8_0 dot product on ARM SVE Jan 29, 2026
@jiangshhh
Copy link
Author

@ggerganov @slaren
Hi,

The PR introduces an ARM SVE optimization for ggml_vec_dot_mxfp4_q8_0, and I have verified correctness and performance on an SVE-capable platform.

This is my first PR to llama.cpp, so I would like to check if there are any additional steps that I should follow for the review.
Please let me know if I need to do something to start the review/approval process.

Thank you very much for your time and for maintaining this project.

@taronaeo
Copy link
Collaborator

@Alcpz By any chance do you have ARM SVE hardware to test and review this? :)

@Alcpz
Copy link
Collaborator

Alcpz commented Jan 29, 2026

@Alcpz By any chance do you have ARM SVE hardware to test and review this? :)

Unfortunately no, I would be happy to help otherwise.

@taronaeo
Copy link
Collaborator

taronaeo commented Feb 6, 2026

I've just spun up an AWS c8gn.2xlarge that has SVE and SVE2 to test this and I can't seem to reproduce the same result that you're getting. Am I missing something?

upstream/master

$ build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf 
model size params backend threads test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CPU 8 pp512 49.33 ± 0.01
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CPU 8 tg128 29.12 ± 0.01

build: f9bd518 (7955)

pr/19171

$ build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf 
model size params backend threads test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CPU 8 pp512 47.07 ± 0.02
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CPU 8 tg128 28.82 ± 0.06

build: 18ad28c (7870)

gcc dump

$ echo | gcc -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme -dM -E - | grep __ARM_FEATURE_SVE
#define __ARM_FEATURE_SVE_BITS 0
#define __ARM_FEATURE_SVE_VECTOR_OPERATORS 1
#define __ARM_FEATURE_SVE2_AES 1
#define __ARM_FEATURE_SVE 1
#define __ARM_FEATURE_SVE2_SHA3 1
#define __ARM_FEATURE_SVE_MATMUL_INT8 1
#define __ARM_FEATURE_SVE_BF16 1
#define __ARM_FEATURE_SVE2 1
#define __ARM_FEATURE_SVE2_BITPERM 1

lscpu

$ lscpu | grep -E "Model name|Flags"

Model name:                              Neoverse-V2
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti

@jiangshhh
Copy link
Author

@taronaeo
Thank you very much for your careful review and for pointing out the performance difference between FX700 and Graviton4.

Regarding the SVE optimization for mxfp4, we initially observed approximately 2x performance improvement on FX700 (A64FX), while no significant speedup was observed on Graviton4 (Neoverse V2). This behavior is consistent with the underlying SIMD microarchitecture.

As shown in the reference table, there are some differs across the following architectures.

  • SVE vector length x number of SVE pipelines
  • NEON 128-bit width x number of NEON pipelines

A64FX (FX700)

  • SVE configuration: 2 x 512-bit
  • NEON configuration: 2 x 128-bit
  • SVE FP64 FMLA peak: 32
  • NEON FP64 FMLA peak: 8

Here, we can know that SVE provides a clear width and throughput advantage over NEON.
Therefore, moving from NEON to SVE can theoretically deliver up to ~4x compute width improvement. In practice, we observed approximately 2x speedup, which is consistent with architectural expectations.

Neoverse V2 (Graviton4/NVIDIA Grace)

  • SVE configuration: 4 x 128-bit
  • NEON configuration: 4 x 128-bit
  • SVE FP64 FMLA peak: 16
  • NEON FP64 FMLA peak: 16

In this case, the effective SIMD throughput of SVE and NEON is architecturally equivalent. Although SVE provides a more flexible programming model, the raw vector width and pipeline count are effectively the same as NEON.
Therefore, equal performance (rather than 2x speedup) is the expected result on Neoverse V2-based systems.

Additional Measurement on NVIDIA Grace

After the latest refinement of the implementation, we re-measured performance on NVIDIA Grace (Neoverse V2) using llama-bench (8 threads, 512 prompt tokens, 128 generation tokens, 5 repetitions).
Before (baseline build 7955)

model size params backend threads test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CPU 8 pp512 39.49 ± 0.14
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CPU 8 tg128 28.12 ± 0.01

After (PR build 7957)

model size params backend threads test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CPU 8 pp512 59.02 ± 0.02
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CPU 8 tg128 36.97 ± 0.03

Summary

On A64FX (512-bit SVE) → SVE has a clear hardware throughput advantage over NEON → 2x speedup observed.
On Neoverse V2 (Graviton4/NVIDIA Grace) → SVE and NEON have equivalent peak SIMD throughput → large speedup is not theoretically expected.
On NVIDIA Grace, after the modification of SVE, from the benchmark results we achieved ~1.2–1.4x improvement due to better implementation efficiency rather than wider SIMD capability.

I hope this clarifies the architectural reason behind the observed performance differences.
Please let me know if anything is needed further.

Thank you again for the valuable feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants